Exploratory Data Analysis Basics

Exploratory Data Analysis Basics#

EDA Definition#

Razones para hacer un EDA#

Organizar y entender las variables
Establecer relaciones entre las variables
Encontrar patrones ocultos en los datos
Ayuda a escoger el modelo correcto para la necesidad correcta
Tomar decisiones informadas

Pasos de EDA#

Tipos de Analitica de datos#

Project#

Environment Setup#

Library installation#

!pip install --upgrade pip

!pip install palmerpenguins==0.1.4 numpy==1.23.4 pandas==1.5.1 seaborn==0.12.1 matplotlib==3.6.0 empiricaldist==0.6.7 statsmodels==0.13.5 scikit-learn==1.1.2 pyjanitor==0.23.1 session-info

Requirement already satisfied: pip in c:\users\admin\miniconda3\envs\jupyter_notebooks_from_deepnote\lib\site-packages (23.2.1)

^C
Collecting palmerpenguins==0.1.4
  Using cached palmerpenguins-0.1.4-py3-none-any.whl (17 kB)
Collecting numpy==1.23.4
  Using cached numpy-1.23.4-cp311-cp311-win_amd64.whl (14.6 MB)
Collecting pandas==1.5.1
  Using cached pandas-1.5.1-cp311-cp311-win_amd64.whl (10.3 MB)
Collecting seaborn==0.12.1
  Using cached seaborn-0.12.1-py3-none-any.whl (288 kB)
Collecting matplotlib==3.6.0
  Using cached matplotlib-3.6.0-cp311-cp311-win_amd64.whl (7.2 MB)
Collecting empiricaldist==0.6.7
  Using cached empiricaldist-0.6.7.tar.gz (11 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting statsmodels==0.13.5
  Using cached statsmodels-0.13.5-cp311-cp311-win_amd64.whl (9.0 MB)
Collecting scikit-learn==1.1.2
  Using cached scikit-learn-1.1.2.tar.gz (7.0 MB)
  Installing build dependencies: started

Library import#

import empiricaldist
import janitor
import matplotlib.pyplot as plt
import numpy as np
import palmerpenguins
import pandas as pd
import scipy.stats
import seaborn as sns
import sklearn.metrics
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats as ss
import session_info

Stablish Chart Settings#

%matplotlib inline
sns.set_style(style='whitegrid')
sns.set_context(context='notebook')
plt.rcParams['figure.figsize'] = (11, 9.4)

penguin_color = {
    'Adelie': '#ff6602ff',
    'Gentoo': '#0f7175ff',
    'Chinstrap': '#c65dc9ff'
}

Importing Dataset#

The dataset chosen for this project is the palmerpenguins dataset

The goal of the Palmer Penguins dataset is to replace the highly overused Iris dataset for data exploration & visualization.

The information contains size measurements, clutch observations, and blood isotope ratios for 344 adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica.

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research (LTER) Program.

Raw Data#

This package contains a raw dataset (literally contains the data as it was collected)

raw_df = palmerpenguins.load_penguins_raw()

raw_df

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	2007-11-11	39.1	18.7	181.0	3750.0	MALE	NaN	NaN	Not enough blood for isotopes.
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	2007-11-11	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454	NaN
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302	NaN
3	PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	2007-11-16	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Adult not sampled.
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	2007-11-16	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
339	PAL0910	64	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N98A2	Yes	2009-11-19	55.8	19.8	207.0	4000.0	MALE	9.70465	-24.53494	NaN
340	PAL0910	65	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N99A1	No	2009-11-21	43.5	18.1	202.0	3400.0	FEMALE	9.37608	-24.40753	Nest never observed with full clutch.
341	PAL0910	66	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N99A2	No	2009-11-21	49.6	18.2	193.0	3775.0	MALE	9.46180	-24.70615	Nest never observed with full clutch.
342	PAL0910	67	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N100A1	Yes	2009-11-21	50.8	19.0	210.0	4100.0	MALE	9.98044	-24.68741	NaN
343	PAL0910	68	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N100A2	Yes	2009-11-21	50.2	18.7	198.0	3775.0	FEMALE	9.39305	-24.25255	NaN

344 rows × 17 columns

processed data#

Dataset containing already processed data

df = palmerpenguins.load_penguins()

df

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
...	...	...	...	...	...	...	...	...
339	Chinstrap	Dream	55.8	19.8	207.0	4000.0	male	2009
340	Chinstrap	Dream	43.5	18.1	202.0	3400.0	female	2009
341	Chinstrap	Dream	49.6	18.2	193.0	3775.0	male	2009
342	Chinstrap	Dream	50.8	19.0	210.0	4100.0	male	2009
343	Chinstrap	Dream	50.2	18.7	198.0	3775.0	female	2009

344 rows × 8 columns

Using the seaborn dataset#

Seaborn also contains this dataset, let’s see it:

sns.load_dataset("penguins")

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
...	...	...	...	...	...	...	...
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	Female
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	Male
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	Female
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	Male

344 rows × 7 columns

Performing EDA#

Now let’s answer some questions, remember that the name of our dataframe is stored in df

What are the variables types for this dataset?#

You can use pandas attribute dtype to do this:

df.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
year                   int64
dtype: object

“object” is usually synonymous of categorical variables
“float64” is usually synonymous of Numerical continuous variables
“int64” is usually synonymous of Numerical discrete variables

How many variables of each type is present in the dataset?#

The value_counts() attribute can help

(
    df
    .dtypes
    .value_counts()
)

float64    4
object     3
int64      1
dtype: int64

How many variables and records are in the dataset?#

The .shape() function can help with this

df.shape

(344, 8)

344 rows
8 columns

That answers the question

Are there null values in the dataset?#

.isnull() returns boolean values
any() tells you if null values are present in a column

See the result below:

df.isnull().any()

species              False
island               False
bill_length_mm        True
bill_depth_mm         True
flipper_length_mm     True
body_mass_g           True
sex                   True
year                 False
dtype: bool

How many null values are there in each column?#

use .sum() to obtain the results

df.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

What is the total amount of null values in the dataset?#

call .sum() again in the previous result

df.isnull().sum().sum()

What is the null values proportion for each variable?#

let’s show it in percentage

( df.isnull().sum()/(df.shape[0]) )*100

species              0.000000
island               0.000000
bill_length_mm       0.581395
bill_depth_mm        0.581395
flipper_length_mm    0.581395
body_mass_g          0.581395
sex                  3.197674
year                 0.000000
dtype: float64

You can actually make a chart to represent this:

melted = df.isnull().melt()

melted

	variable	value
0	species	False
1	species	False
2	species	False
3	species	False
4	species	False
...	...	...
2747	year	False
2748	year	False
2749	year	False
2750	year	False
2751	year	False

2752 rows × 2 columns

melted.pipe(
    lambda x : sns.displot(data=x, y="variable", hue="value", multiple="fill", aspect=2)
)

<seaborn.axisgrid.FacetGrid at 0x7fa235ed23a0>

../_images/b8bbef47af0f20141c8afaecb45367bc3331c5c9e56918ecd55f354bddc9a9aa.png

Los valores nulos no llegan ni al 10%

What are the records that contain the null values in the dataset? (Visualize)#

First create the dataframe

nulls = df.isnull().transpose()
nulls

	0	1	2	3	4	5	6	7	8	9	...	334	335	336	337	338	339	340	341	342	343
species	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
island	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
bill_length_mm	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
bill_depth_mm	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
flipper_length_mm	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
body_mass_g	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
sex	False	False	False	True	False	False	False	False	True	True	...	False	False	False	False	False	False	False	False	False	False
year	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

8 rows × 344 columns

Now the chart

plt.figure(figsize=(7, 7))
nulls.pipe(
    lambda x : sns.heatmap(data=x)
)

# Alternatively, use:
#df.isnull().transpose().pipe(lambda x : sns.heatmap(data=x))

<AxesSubplot: >

../_images/7cc0c5f8be1cd12ff2a5dc36cdaf4d6fdac3a7fba992c6740fb70d7df575a254.png

Here you can evidence:

The missing values (Not corresponding to sex) are associated with only two records
The sex missing values are consistant, we should not worry about that
You might want to delete those two records that contain the vast majority of missing values (Not corresponding to sex)

How many records will be lost if we delete all the missing values?#

df.dropna()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male	2007
...	...	...	...	...	...	...	...	...
339	Chinstrap	Dream	55.8	19.8	207.0	4000.0	male	2009
340	Chinstrap	Dream	43.5	18.1	202.0	3400.0	female	2009
341	Chinstrap	Dream	49.6	18.2	193.0	3775.0	male	2009
342	Chinstrap	Dream	50.8	19.0	210.0	4100.0	male	2009
343	Chinstrap	Dream	50.2	18.7	198.0	3775.0	female	2009

333 rows × 8 columns

Note that now you have 333 rows, instead of the original 344

THIS COMMAND DOES NOT MODIFY YOUR ORIGINAL DATABASE!!!!

df.shape

(344, 8)

DO YOU SEE?

SO, LET’S SAVE IT into another variable ;)

df_clean = df.dropna()

df_clean

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	male	2007
...	...	...	...	...	...	...	...	...
339	Chinstrap	Dream	55.8	19.8	207.0	4000.0	male	2009
340	Chinstrap	Dream	43.5	18.1	202.0	3400.0	female	2009
341	Chinstrap	Dream	49.6	18.2	193.0	3775.0	male	2009
342	Chinstrap	Dream	50.8	19.0	210.0	4100.0	male	2009
343	Chinstrap	Dream	50.2	18.7	198.0	3775.0	female	2009

333 rows × 8 columns

Counts and Proportions#

What measures describe our dataset?#

The function .describe() can do this for us, but it will just do it for numerical variables.

To include strings use .describe(include=”all”)

df.describe(include="all")

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
count	344	344	342.000000	342.000000	342.000000	342.000000	333	344.000000
unique	3	3	NaN	NaN	NaN	NaN	2	NaN
top	Adelie	Biscoe	NaN	NaN	NaN	NaN	male	NaN
freq	152	168	NaN	NaN	NaN	NaN	168	NaN
mean	NaN	NaN	43.921930	17.151170	200.915205	4201.754386	NaN	2008.029070
std	NaN	NaN	5.459584	1.974793	14.061714	801.954536	NaN	0.818356
min	NaN	NaN	32.100000	13.100000	172.000000	2700.000000	NaN	2007.000000
25%	NaN	NaN	39.225000	15.600000	190.000000	3550.000000	NaN	2007.000000
50%	NaN	NaN	44.450000	17.300000	197.000000	4050.000000	NaN	2008.000000
75%	NaN	NaN	48.500000	18.700000	213.000000	4750.000000	NaN	2009.000000
max	NaN	NaN	59.600000	21.500000	231.000000	6300.000000	NaN	2009.000000

What if i want only numerical variables?

df.describe(include=[np.number])

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	year
count	342.000000	342.000000	342.000000	342.000000	344.000000
mean	43.921930	17.151170	200.915205	4201.754386	2008.029070
std	5.459584	1.974793	14.061714	801.954536	0.818356
min	32.100000	13.100000	172.000000	2700.000000	2007.000000
25%	39.225000	15.600000	190.000000	3550.000000	2007.000000
50%	44.450000	17.300000	197.000000	4050.000000	2008.000000
75%	48.500000	18.700000	213.000000	4750.000000	2009.000000
max	59.600000	21.500000	231.000000	6300.000000	2009.000000

What if i want only categorical variables?

df.describe(include=object)

	species	island	sex
count	344	344	333
unique	3	3	2
top	Adelie	Biscoe	male
freq	152	168	168

BONUS: Transform an object into a category#

See how to transform a column with the data type “object into a category”

This allows cool things such as see the proportions

df.astype({
        "species":"category",       #just insert a dict where key is column name and value is the type
        "island":"category",
        "sex":"category"
    })

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	male	2007
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	female	2007
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	female	2007
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN	2007
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	female	2007
...	...	...	...	...	...	...	...	...
339	Chinstrap	Dream	55.8	19.8	207.0	4000.0	male	2009
340	Chinstrap	Dream	43.5	18.1	202.0	3400.0	female	2009
341	Chinstrap	Dream	49.6	18.2	193.0	3775.0	male	2009
342	Chinstrap	Dream	50.8	19.0	210.0	4100.0	male	2009
343	Chinstrap	Dream	50.2	18.7	198.0	3775.0	female	2009

344 rows × 8 columns

How to visualize counts?#

Pandas

plt.figure(figsize=(6, 6))
df.species.value_counts().plot(kind="bar");

../_images/95f13fb66821ed36d730a11c9a3a454d127aff0abfb4577bdafdc550e944b71e.png

Seaborn

sns.catplot(data=df, x="species", kind="count", palette=penguin_color);

../_images/ab886059541f40cca83212b1edebb49845ef0080f314ff9f4a549fbe798959af.png

How to visualize proportions#

Seaborn

df.add_column("x", "").pipe(
    lambda df: (
        sns.displot(data=df, x="x", hue="species", multiple="fill", palette=penguin_color)
    )
);

../_images/d59cba64c1475a939fc47c037d45b8245f04b4c42b540317a4ad7288c757eadd.png

Measures of central tendency#

Mean#

for all the variables

df.mean()

/tmp/ipykernel_1302/3698961737.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.mean()

bill_length_mm         43.921930
bill_depth_mm          17.151170
flipper_length_mm     200.915205
body_mass_g          4201.754386
year                 2008.029070
dtype: float64

A penguin is around 4 kg LOLLLLLL

Only one variable

df.bill_depth_mm.mean()

17.151169590643274

Median#

df.median()

/tmp/ipykernel_1302/530051474.py:1: FutureWarning: The default value of numeric_only in DataFrame.median is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.median()

bill_length_mm         44.45
bill_depth_mm          17.30
flipper_length_mm     197.00
body_mass_g          4050.00
year                 2008.00
dtype: float64

Mode#

df.mode()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
0	Adelie	Biscoe	41.1	17.0	190.0	3800.0	male	2009

Works for categorical too

Measures of dispersion#

Maximum value for each variable?#

df.max(numeric_only=True)           #numeric_only=True excludes the categorical variables(makes sense)

bill_length_mm         59.6
bill_depth_mm          21.5
flipper_length_mm     231.0
body_mass_g          6300.0
year                 2009.0
dtype: float64

Minimum value for each variable?#

df.min(numeric_only=True)

bill_length_mm         32.1
bill_depth_mm          13.1
flipper_length_mm     172.0
body_mass_g          2700.0
year                 2007.0
dtype: float64

Range of each variable#

df.max(numeric_only=True) - df.min(numeric_only=True)

bill_length_mm         27.5
bill_depth_mm           8.4
flipper_length_mm      59.0
body_mass_g          3600.0
year                    2.0
dtype: float64

Standart deviation#

df.std()

/tmp/ipykernel_1302/3390915376.py:1: FutureWarning: The default value of numeric_only in DataFrame.std is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.std()

bill_length_mm         5.459584
bill_depth_mm          1.974793
flipper_length_mm     14.061714
body_mass_g          801.954536
year                   0.818356
dtype: float64

Interquartile range?#

The IQR equals Q3-Q1, and it holds the 50% of the data around the mean

Q1

df.quantile(0.25)

/tmp/ipykernel_1302/3656653379.py:1: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.quantile(0.25)

bill_length_mm         39.225
bill_depth_mm          15.600
flipper_length_mm     190.000
body_mass_g          3550.000
year                 2007.000
Name: 0.25, dtype: float64

Q3

df.quantile(0.75)

/tmp/ipykernel_1302/3799946287.py:1: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.quantile(0.75)

bill_length_mm         48.5
bill_depth_mm          18.7
flipper_length_mm     213.0
body_mass_g          4750.0
year                 2009.0
Name: 0.75, dtype: float64

IQR

(
    df
    .quantile(q=[0.75, 0.50, 0.25])
    .transpose()
    .rename_axis("variable")
    .reset_index()
    .assign(
        iqr=lambda df: df[0.75] - df[0.25]
    )
)

/tmp/ipykernel_1302/4097684495.py:2: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df

	variable	0.75	0.5	0.25	iqr
0	bill_length_mm	48.5	44.45	39.225	9.275
1	bill_depth_mm	18.7	17.30	15.600	3.100
2	flipper_length_mm	213.0	197.00	190.000	23.000
3	body_mass_g	4750.0	4050.00	3550.000	1200.000
4	year	2009.0	2008.00	2007.000	2.000

Probabilities#

Probability Mass function#

Probability for each value

empiricaldist.Pmf.from_seq(df.flipper_length_mm)

	probs
172.0	0.002924
174.0	0.002924
176.0	0.002924
178.0	0.011696
179.0	0.002924
180.0	0.014620
181.0	0.020468
182.0	0.008772
183.0	0.005848
184.0	0.020468
185.0	0.026316
186.0	0.020468
187.0	0.046784
188.0	0.017544
189.0	0.020468
190.0	0.064327
191.0	0.038012
192.0	0.020468
193.0	0.043860
194.0	0.014620
195.0	0.049708
196.0	0.029240
197.0	0.029240
198.0	0.023392
199.0	0.017544
200.0	0.011696
201.0	0.017544
202.0	0.011696
203.0	0.014620
205.0	0.008772
206.0	0.002924
207.0	0.005848
208.0	0.023392
209.0	0.014620
210.0	0.040936
211.0	0.005848
212.0	0.020468
213.0	0.017544
214.0	0.017544
215.0	0.035088
216.0	0.023392
217.0	0.017544
218.0	0.014620
219.0	0.014620
220.0	0.023392
221.0	0.014620
222.0	0.017544
223.0	0.005848
224.0	0.008772
225.0	0.011696
226.0	0.002924
228.0	0.011696
229.0	0.005848
230.0	0.020468
231.0	0.002924

Then, if i want to know that is the specific probability of getting a value:

empiricaldist.Pmf.from_seq(df.flipper_length_mm)(190)

0.06432748538011696

Empirical Cumulative Distribution Function#

This one returns the cumulative probability based on the dataset

plt.figure(figsize=(6, 6))
sns.ecdfplot(data=df, x="flipper_length_mm");

../_images/66e412ca0c0039af3855304eba8e9595da02ccee3118c77a3563d3a39f37b13e.png

Probabilities of getting a penguin with 200 or less flipper_length?

empiricaldist.Cdf.from_seq(df.flipper_length_mm)(200)

array(0.56725146)

See the chart below for a better interpretation

plt.figure(figsize=(6, 6))

empiricaldist.Cdf.from_seq(df.flipper_length_mm).plot()

q=200
p = empiricaldist.Cdf.from_seq(df.flipper_length_mm)(200)

plt.vlines(
    x=q,
    ymin=0,
    ymax=p,
    color="black",
    linestyle="dashed"
)

plt.hlines(
    y=p,
    xmin=empiricaldist.Cdf.from_seq(df.flipper_length_mm).qs[0],
    xmax=q,
    color="black",
    linestyle="dashed"
)

plt.plot(q, p, "ro");

../_images/566f3bb0680c821b4f0cf0ab24939fe168c47eb97239a39bc9379956fdbc70b8.png

Correlations#

Pairplot#

In this case, please remove the year column, despite being a numeric column, it is not adding value to the correlation analysis

sns.pairplot(data=df.drop(["year"], axis=1));

../_images/f1733658e2825886bb3cb71846c0d812f656ec524aef4c650236310893bcc52f.png

Pairplot per specie#

sns.pairplot(data=df.drop(["year"], axis=1), hue="species", palette=penguin_color);

../_images/d21836576320161a18823e6a8a04f086c1e4ddb634edc3d7c9536f0d5a1f04b3.png

Swarmplot#

This one can be used to associate a categorical variable with a numerical:

plt.figure(figsize=(6, 6))
sns.swarmplot(
    data=df,
    x="island",
    y="body_mass_g",
    hue="species",
    palette=penguin_color
);

../_images/7d307b7a215533b707cd6761ce527f991b4c9a0f7d67f786c7f9c1e425b86c41.png

There doesn’t seem to be a relation among the penguin body mass and the island where it is located
there is a relation among the body mass and the specie

Is there a linear correlation among any of our variables?#

df.drop(["year"], axis=1).corr()          #first remove the year column and then run

/tmp/ipykernel_1302/2116061803.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.drop(["year"], axis=1).corr()          #first remove the year column and then run

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
bill_length_mm	1.000000	-0.235053	0.656181	0.595110
bill_depth_mm	-0.235053	1.000000	-0.583851	-0.471916
flipper_length_mm	0.656181	-0.583851	1.000000	0.871202
body_mass_g	0.595110	-0.471916	0.871202	1.000000

Flipper length and body mass have a 0.8712 correlation coef, high

let’s visualize that into a matrix

plt.figure(figsize=(6, 6))
sns.heatmap(
    data=df.drop(["year"], axis=1).corr(),
    cmap=sns.diverging_palette(20, 230, as_cmap=True),        #nice color palette
    center=0,                                                 #centering the color legend
    vmin=-1,                                                  #set min corr value to -1
    vmax=1,                                                   #set max corr value to 1
    linewidths=0.5,
    annot=True
)

/tmp/ipykernel_1302/2892716928.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  data=df.drop(["year"], axis=1).corr(),

<AxesSubplot: >

../_images/7c6a1d51571eb247cac0ec80c3064cf9033379a0e5e2435dc1c1c4594c45fa4f.png

sns.clustermap(
    data=df.drop(["year"], axis=1).corr(),
    cmap=sns.diverging_palette(20, 230, as_cmap=True),        #nice color palette
    center=0,                                                 #centering the color legend
    vmin=-1,                                                  #set min corr value to -1
    vmax=1,                                                   #set max corr value to 1
    linewidths=0.5,
    annot=True,
    figsize=(6,6)
);

/tmp/ipykernel_1302/2652653307.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  data=df.drop(["year"], axis=1).corr(),

../_images/7a42e74d9f3a6816de77daa489665f46bade5d15c6960f5ce15ff814db497624.png

How can i represent a categorical variable as a discrete numerical?#

Let’s do it with sex, set female to 0 and male to 1

df = df.assign(
    numeric_sex=lambda df : df.sex.replace(["female", "male"], [0, 1])
)

df_clean = df.dropna()

sns.clustermap(
    data=df.drop(["year"], axis=1).corr(),
    cmap=sns.diverging_palette(20, 230, as_cmap=True),        #nice color palette
    center=0,                                                 #centering the color legend
    vmin=-1,                                                  #set min corr value to -1
    vmax=1,                                                   #set max corr value to 1
    linewidths=0.5,
    annot=True,
    figsize=(6,6)
);

/tmp/ipykernel_1302/2652653307.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  data=df.drop(["year"], axis=1).corr(),

../_images/b0091202a69611d0de171f9e55703ea059da9302b61c0f3ffa6f8ad3680b6088.png

Look like linear correlation is low among sex and the other variables

Multiple regression#

I forgot the scale to weigh the penguins, what is the best way to capture that data?

Let’s create many linear regression models and compare them

Use the df_clean, which contains the dataset without null values, just to avoid errors

Model 1#

let’s create a model to predict the “body_mass” using the variable “bill_depth”

#Creating the model

model_1 = (
    smf.ols(
        formula="body_mass_g ~ bill_length_mm",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_1.summary()

OLS Regression Results
Dep. Variable:	body_mass_g	R-squared:	0.347
Model:	OLS	Adj. R-squared:	0.345
Method:	Least Squares	F-statistic:	176.2
Date:	Mon, 09 Jan 2023	Prob (F-statistic):	1.54e-32
Time:	21:34:02	Log-Likelihood:	-2629.1
No. Observations:	333	AIC:	5262.
Df Residuals:	331	BIC:	5270.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	388.8452	289.817	1.342	0.181	-181.271	958.961
bill_length_mm	86.7918	6.538	13.276	0.000	73.931	99.652

Omnibus:	6.141	Durbin-Watson:	0.849
Prob(Omnibus):	0.046	Jarque-Bera (JB):	4.899
Skew:	-0.197	Prob(JB):	0.0864
Kurtosis:	2.555	Cond. No.	360.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 2#

Now let’s add a second variable “bill_depth_mm”

#Creating the model

model_2 = (
    smf.ols(
        formula="body_mass_g ~ bill_length_mm + bill_depth_mm",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_2.summary()

OLS Regression Results
Dep. Variable:	body_mass_g	R-squared:	0.467
Model:	OLS	Adj. R-squared:	0.464
Method:	Least Squares	F-statistic:	144.8
Date:	Mon, 09 Jan 2023	Prob (F-statistic):	7.04e-46
Time:	21:34:02	Log-Likelihood:	-2595.2
No. Observations:	333	AIC:	5196.
Df Residuals:	330	BIC:	5208.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3413.4519	437.911	7.795	0.000	2552.002	4274.902
bill_length_mm	74.8126	6.076	12.313	0.000	62.860	86.765
bill_depth_mm	-145.5072	16.873	-8.624	0.000	-178.699	-112.315

Omnibus:	2.839	Durbin-Watson:	1.798
Prob(Omnibus):	0.242	Jarque-Bera (JB):	2.175
Skew:	-0.000	Prob(JB):	0.337
Kurtosis:	2.604	Cond. No.	644.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 3#

Now we can add the “flipper_length_mm” and see what happens

#Creating the model

model_3 = (
    smf.ols(
        formula="body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_3.summary()

OLS Regression Results
Dep. Variable:	body_mass_g	R-squared:	0.764
Model:	OLS	Adj. R-squared:	0.762
Method:	Least Squares	F-statistic:	354.9
Date:	Mon, 09 Jan 2023	Prob (F-statistic):	9.26e-103
Time:	21:34:02	Log-Likelihood:	-2459.8
No. Observations:	333	AIC:	4928.
Df Residuals:	329	BIC:	4943.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-6445.4760	566.130	-11.385	0.000	-7559.167	-5331.785
bill_length_mm	3.2929	5.366	0.614	0.540	-7.263	13.849
bill_depth_mm	17.8364	13.826	1.290	0.198	-9.362	45.035
flipper_length_mm	50.7621	2.497	20.327	0.000	45.850	55.675

Omnibus:	5.596	Durbin-Watson:	1.982
Prob(Omnibus):	0.061	Jarque-Bera (JB):	5.469
Skew:	0.312	Prob(JB):	0.0649
Kurtosis:	3.068	Cond. No.	5.44e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Model 4#

#Creating the model

model_4 = (
    smf.ols(
        formula="body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + C(sex)",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_4.summary()

OLS Regression Results
Dep. Variable:	body_mass_g	R-squared:	0.823
Model:	OLS	Adj. R-squared:	0.821
Method:	Least Squares	F-statistic:	381.3
Date:	Mon, 09 Jan 2023	Prob (F-statistic):	6.28e-122
Time:	21:34:02	Log-Likelihood:	-2411.8
No. Observations:	333	AIC:	4834.
Df Residuals:	328	BIC:	4853.
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2288.4650	631.580	-3.623	0.000	-3530.924	-1046.006
C(sex)[T.male]	541.0285	51.710	10.463	0.000	439.304	642.753
bill_length_mm	-2.3287	4.684	-0.497	0.619	-11.544	6.886
bill_depth_mm	-86.0882	15.570	-5.529	0.000	-116.718	-55.459
flipper_length_mm	38.8258	2.448	15.862	0.000	34.011	43.641

Omnibus:	2.598	Durbin-Watson:	1.843
Prob(Omnibus):	0.273	Jarque-Bera (JB):	2.125
Skew:	0.062	Prob(JB):	0.346
Kurtosis:	2.629	Cond. No.	7.01e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.01e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Model 5#

Let’s now try with the “flipper_length_mm” alone

#Creating the model

model_5 = (
    smf.ols(
        formula="body_mass_g ~ flipper_length_mm + C(sex)",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_5.summary()

OLS Regression Results
Dep. Variable:	body_mass_g	R-squared:	0.806
Model:	OLS	Adj. R-squared:	0.805
Method:	Least Squares	F-statistic:	684.8
Date:	Mon, 09 Jan 2023	Prob (F-statistic):	3.53e-118
Time:	21:34:02	Log-Likelihood:	-2427.2
No. Observations:	333	AIC:	4860.
Df Residuals:	330	BIC:	4872.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-5410.3002	285.798	-18.931	0.000	-5972.515	-4848.085
C(sex)[T.male]	347.8503	40.342	8.623	0.000	268.491	427.209
flipper_length_mm	46.9822	1.441	32.598	0.000	44.147	49.817

Omnibus:	0.262	Durbin-Watson:	1.710
Prob(Omnibus):	0.877	Jarque-Bera (JB):	0.376
Skew:	0.051	Prob(JB):	0.829
Kurtosis:	2.870	Cond. No.	2.95e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.95e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Note that only one variable explains the 76% of the variance.

You decide if sticking with one variable and getting 76% of variance, or stick with 4 and get 80%.

Multiple Regression Model Comparisons#

Creation of a results table#

model_results = pd.DataFrame(
    dict(
        actual_value = df_clean.body_mass_g,
        prediction_model_1 = model_1.predict(),
        prediction_model_2 = model_2.predict(),
        prediction_model_3 = model_3.predict(),
        prediction_model_4 = model_4.predict(),
        prediction_model_5 = model_5.predict(),
        species = df_clean.species,
        sex = df_clean.sex
    )
)

model_results

	actual_value	prediction_model_1	prediction_model_2	prediction_model_3	prediction_model_4	prediction_model_5	species	sex
0	3750.0	3782.402961	3617.641192	3204.761227	3579.136946	3441.323750	Adelie	male
1	3800.0	3817.119665	3836.725580	3436.701722	3343.220772	3328.384372	Adelie	female
2	3250.0	3886.553073	3809.271371	3906.897032	3639.137335	3751.223949	Adelie	female
4	3450.0	3574.102738	3350.786581	3816.705772	3457.954243	3657.259599	Adelie	female
5	3650.0	3799.761313	3356.140070	3696.168128	3764.536023	3864.163327	Adelie	male
...	...	...	...	...	...	...	...	...
339	4000.0	5231.825347	4706.954140	4599.187485	4455.022405	4662.860306	Chinstrap	male
340	3400.0	4164.286703	4034.121055	4274.552753	3894.857519	4080.099176	Chinstrap	female
341	3775.0	4693.716437	4475.927353	3839.563668	4063.639819	4005.109853	Chinstrap	male
342	4100.0	4797.866549	4449.296758	4720.740455	4652.013882	4803.806832	Chinstrap	male
343	3775.0	4745.791493	4448.061337	4104.268240	3672.299099	3892.170475	Chinstrap	female

333 rows × 8 columns

Comparing predicted values vs Actual values (ECDFs)#

plt.figure(figsize=(6, 6))
sns.ecdfplot(
    data=model_results
);

../_images/6c34c284d66e96f51964e87ade83e43b2c7a68bd45c8c4d7e6e9333889710538.png

Now, note that the BLUE LINE ARE THE ACTUAL VALUES

Therefore, the line that is closer to the model, would be the better model

Comparing predicted values distributions vs Original values distributions(Kdeplot)#

plt.figure(figsize=(6, 6))
sns.kdeplot(
    data=model_results
);

../_images/effe5cb70103f43b577ad222a453df68906b32e8cdea657891e12a32fa9c5cc6.png

Looks like the purple line (model 4) fits better, but that contains 4 variables.

The model 5 also fist really great with only two variables

Bonus: Pandas Profiling Report#

import palmerpenguins
import pandas as pd
import pandas_profiling as pp

df = palmerpenguins.load_penguins()

pp.ProfileReport(df)

Created in Deepnote

Exploratory Data Analysis Basics

Contents

Exploratory Data Analysis Basics#

EDA Definition#

Razones para hacer un EDA#

Pasos de EDA#

Tipos de Analitica de datos#

Project#

Environment Setup#

Library installation#

Library import#

Stablish Chart Settings#

Importing Dataset#

Raw Data#

processed data#

Using the seaborn dataset#

Performing EDA#

What are the variables types for this dataset?#

How many variables of each type is present in the dataset?#

How many variables and records are in the dataset?#

Are there null values in the dataset?#

How many null values are there in each column?#

What is the total amount of null values in the dataset?#

What is the null values proportion for each variable?#

What are the records that contain the null values in the dataset? (Visualize)#

How many records will be lost if we delete all the missing values?#

Counts and Proportions#

What measures describe our dataset?#

BONUS: Transform an object into a category#

How to visualize counts?#

How to visualize proportions#

Measures of central tendency#

Mean#

Median#

Mode#

Measures of dispersion#

Maximum value for each variable?#

Minimum value for each variable?#

Range of each variable#

Standart deviation#

Interquartile range?#

Probabilities#

Probability Mass function#

Empirical Cumulative Distribution Function#

Correlations#

Pairplot#

Pairplot per specie#

Swarmplot#

Is there a linear correlation among any of our variables?#

How can i represent a categorical variable as a discrete numerical?#

Multiple regression#

Model 1#

Model 2#

Model 3#

Model 4#

Model 5#

Multiple Regression Model Comparisons#

Creation of a results table#

Comparing predicted values vs Actual values (ECDFs)#

Comparing predicted values distributions vs Original values distributions(Kdeplot)#

Bonus: Pandas Profiling Report#